In this tutorial, we will be fitting Bayesian distance sampling-adjusted log-Gaussian Cox process models to the survey data using the inlabru package. These initial models will provide us with baseline estimates of the total number of ‘groups’ present in the domain, and baseline estimates of species-environmental relationships. We will then compare these estimates with those reported from the integrative joint log-Gaussian Cox process models in the third session. The models in the third session will use both the distance sampling survey data and the opportunistic encounters data reported by whale watch companies!
As we mentioned in the previous workshop, we have chosen to simplify the analysis and model only the presence/absence. Whilst some of the data sources we are using recorded the counts of fin whales per encounter, we have discarded this information. Thus, we will only be able to estimate the total number of ‘groups’ of whales within our study region and not the population size.1.
Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.
library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)
Using getwd() check that you are inside the folder ‘DFO_SDM_Workshop_2020’ that we downloaded in the previous tutorial. If not, change this by navigating to the correct folder in the bottom right panel (folder view), opening it, and then clicking “Set as Working Directory” under the tab ‘More.’
Now we can load in the precompiled data and functions
list2env(readRDS('./Data/Modelling_Data.rds'),globalenv())
## <environment: R_GlobalEnv>
source('utility_functions.R')
Finally, let’s turn off all warnings associated with coordinate reference systems.
rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")
The log-Gaussian Cox process (LGCP) models we are interested in require the specification of Gaussian random fields. Adding Gaussian random fields to point process models can become computationally prohibitive. To combat this, we will use two computational tricks. The first involves approximating a stochastic partial differential equation (SPDE) with a Gaussian Markov random field (GMRF) evaluated across a delauney triangulation mesh. This approach was developed by Lindgren and Rue (2011) (Lindgren being an author of the inlabru package). The second trick involves using the Integrated Nested Laplace Approximation (INLA) method for computing approximate posterior distributions Rue, Martino, and Chopin (2009). inlabru simplifies the use of these two computational tricks and provides additional methods for fitting point process models (among other things).
The first step to using this approach is creating the mesh. Good rules of thumb are as follows:
Define maximum triangle lengths no shorter than half of the expected spatial range of your problem. Spatial range here is loosely defined as the ‘distance required between two locations for the whale intensity at these two locations to be approximately independent of each other.’ Note that it is impossible to detect clusters or hotspots with sizes smaller than the maximum length of triangles used.
Do not define the minimum triangle lengths too small. This could lead to very large meshes! This is specified with the argument cutoff.
Use a minimum triangle angle of 21 degrees.
We define a first mesh using the function inla.mesh.2d from the INLA package. We choose a minimum triangle edge length of 18km, a maximum triangle edge length of 45km, a minimum angle of 25 degrees and we specify that we want a hard boundary equal to the coastline (as defined by the SpatialPolygonDataFrame Domain). Note that we can plot meshes using gg as before2.
mesh_land <- inla.mesh.2d(boundary = Domain,
min.angle = 25,
max.edge = 45,
cutoff = 18)
ggplot() + gg(mesh_land) + gg.spatiallines_mod(Effort_survey,colour='red')
Note that spatial correlations < 18km will not be detectable! We can print the number of mesh vertices (directly related to the length of time required later to fit the model).
mesh_land$n # lots of triangle vertices can lead to slow computation time. 453 good
## [1] 453
We have a potential serious issue with this mesh. The coastline is jagged, which could lead to numerical instabilities to occur later. One easy solution is to buffer and/or smooth the coastline shapefile, making sure that all survey lines and sightings fall within the new modified shapefile. We did this in the exploratory analysis - creating a Domain_restricted object! We plot this now as a reminder!
ggplot() + gg(Domain_restricted) +
gg.spatiallines_mod(Effort_survey,colour='red')
Using this new buffered and smoothed domain/study area shapefile (i.e. Domain_restricted), we create a new mesh below named mesh_land.
mesh_land <- inla.mesh.2d(boundary = Domain_restricted,
min.angle = 25,
max.edge = 45,
cutoff = 18)
mesh_land$crs <- Can_proj
mesh_land$n
## [1] 498
ggplot() +
gg(mesh_land) +
gg.spatiallines_mod(Effort_survey,colour='red')
One may be tempted to define the mesh within a much larger convex hull around the coastline, but see Additional tips for explanation of why this is an inadequate approach.
Next, we need to define our Gaussian random field model and its priors. To do this, we use the function inla.spde2.pcmatern from the INLA package. Note that the arguments prior.sigma and prior.range define arguments for the penalized complexity (PC) prior of the random field developed by Simpson et al. (2017) and Fuglstad et al. (2019).
We are required to define PC priors, on the spatial range \(\rho\) (the wiggliness) and marginal standard deviation \(\sigma\) (the magnitude) of the field. PC priors take the form of prior probabilistic statements on quantiles, providing a user-friendly way of defining sensible, interpretable priors.
Before defining our priors, let’s explore how the Gaussian random field is affected by the range and marginal standard deviation. Here are example plots.
The above plot is a single realisation of a Gaussian random field on the unit square with standard deviation equal to 1 and range equal to 0.7. The next plot shows a second realisation, to make clear that a very large number of behaviours and patterns can be generated with a given value of the parameters.
Below, we decrease the range from 0.7 to 0.01. Notice how the field becomes much more ‘wiggly!’
Finally, we increase the marginal standard deviation to 2 below. Notice how the range of values of the field increases!
If the effects of the two parameters on the nature of the random field is still unclear, we provide an additional simulation script in the additional comments at the end of this document to let you explore the effects of the two parameters.
With the understanding of what the SD and range define, we are asked to define the following two probabilistic statements (remember we are working in units of km):
\[P(\rho < \rho_0) = p_{\rho} \\ P(\sigma > \sigma_0) = p_{\sigma}.\] The goal is to specify the lower tail quantile \(\rho_0\) and probability \(p_{\rho}\) for the range parameter and the upper tail quantile \(\sigma_0\) and probability \(p_{\sigma}\) for the marginal standard deviation. We choose these 4 parameters to help ensure that the field does not become to ‘complex’; specifically we want to avoid the field becoming too ‘wiggly’ or ‘large.’ The PC prior shrinks the spatial field towards the base model of no spatial variance and variance zero, hence the name ‘Penalized Complexity.’ Note that as the complexity of the field increases, so does the risk of overfitting our data!
We now define our model, with \(\rho_0\), \(p_{\rho}\), \(\sigma_0\), and \(p_{\sigma}\) fixed at 35km, 0.01, 2, and 0.01 respectively. Roughly speaking, our priors state that the range is likely larger than 35km and the standard deviation is smaller than 2 whale encounters. JOE: check this simplified interpretation I have added.
# First define the SPDE model
matern_land <- inla.spde2.pcmatern(mesh_land,
prior.sigma = c(2, 0.01),
prior.range = c(35, 0.01))
Next, we must define our detection probability function. As we discussed in the previous tutorial, we will use a half-normal detection probability function. The object we will define will take the form of a base R function.
# Define a half-normal detection probability function (code from inlabru tutorials)
log_hn = function(distance, lsig){
-0.5*(distance/exp(lsig))^2
}
Note that we define the half-normal function on the log scale. In particular we have defined:
\[\textrm{log}(p(d)) = \frac{-d^2}{2\sigma^2}\\ \rightarrow (p(d)) = exp\left(\frac{-d^2}{2\sigma^2}\right) \]
the parameter lsig is equal to \(log(\sigma)\). Defining the log of the parameter ensures that our estimates of \(\sigma\) are strictly positive.
Next, we need to describe how inlabru should combine the components of the model: the Gaussian random field (GRF), the detection probability function parameter lsig, and the intercept.
We will give the GRF the name ‘mySpatial.’ This name is arbitrary and can be changed. To define ‘mySpatial’ as a GRF, we will point it to the model object we defined earlier (matern_land) and we will tell inlabru that the coordinates of the points are the locations at which we wish to evaluate the field at. coordinates is a special variable name reserved for the spatial coordinates of a SpatialPointsDataFrame object. The encounters data we will be using are all SpatialPointsDataFrame objects.
The scalar terms Intercept and the detection function parameter lsig are given the special arguments 1. This tells inlabru to create a scalar random variable. These will be available for every data point.
# Define the components of the model
cmp_land1 <- ~ mySpatial(main = coordinates, model = matern_land) +
lsig(1) + Intercept(1)
Note that the components do not define the formula of the model. We must define the formula separately. The formula describes how these components are combined to form the linear predictor. For our log-Gaussian Cox process, the linear predictor will be the model for log \(\lambda_{obs}\).
We now define the formula. On the left hand side of the formula are the response variables. We have two for a distance sampling dataset. The first, are the collections of locations \(Y\) (i.e. the coordinates). The second are the recorded perpendicular distances from the aircraft.
On the right hand side of the formula, we have the GRF mySpatial. Next, we include the scalar parameter lsig. Unlike mySpatial, we include lsig in the linear predictor as an argument to the detection probability function log_hn. Notice that the log_hn function is evaluated at the response variable distance. Finally, we also include the variable Intercept.
Additionally, the offset of log(2) is required due to the unknown direction of the detections (detections can occur either side of the aircraft). For now, we ignore covariates.
formula_1 = coordinates + distance ~ mySpatial +
log_hn(distance, lsig) +
Intercept + log(2)
For clarity, the linear predictor defined in formula_1 is:
\[\textrm{log}(\lambda_{true}(s)) = \beta_0 + Z(s) \\ \textrm{log}(p(d)) = \frac{-d^2}{2\sigma^2}.\]
Thus, the expected encounter rate of an animal at location \(s\), a distance \(d\) away from the aircraft is:
\[\lambda_{obs}(s) = 2 \lambda_{true}(s) p(d).\]
Note that \(Z(s)\) is the value of the GRF mySpatial evaluated at position \(s\), and \(\sigma\) is the exponential of lsig. We multiply \(\lambda_{true}(s) p(d)\) by 2 since detections can occur either side of the aircraft.
Defining the components and the formula separately in two stages is what gives inlabru its generalisibility. Parameters can be included within a model non-linearly, unlike most other R packages! For example, lsig was included nonlinearly through our custom function log_hn! Behind the scenes, for nonlinear models, inlabru linearises the model via a Taylor series expansion.
Furthermore, components and formulae can be stacked together for datasets from differing sources, allowing for integrative joint models to be fit with relative ease! For example, the same components can be shared across different model, but take different forms for each (e.g. Intercept in model 1, Intercept^2 in model 2, etc.,)!
For detailed additional information on the types of model components allowed by inlabru and the linearisation procedure performed, call ?component and here.
Before fitting the model, we set the following global options for inlabru.
bru_options_set(
bru_verbose = TRUE,
verbose = TRUE,
control.inla=list(int.strategy='eb'),
num.threads=2
)
We want both INLA and inlabru to display all messages about the convergence, because these messages can help with diagnosing any issues with the model fitting. To do so, we set verbose arguments to TRUE.
For the purposes of this workshop, we also tell INLA to perform an empirical Bayes analysis (control.inla=list(int.strategy='eb')) instead of a full Bayesian analysis. This speeds up computation dramatically, but fails to propagate the full uncertainties associated with the hyperparameters into inference. This is because these empirical Bayes analyses condition on the single ‘most-likely’ model defined by the posterior modes of the hyperparameters. In contrast, a fully-Bayesian analysis performs Bayesian model averaging by averaging across (possibly infinitely many) models defined by all possible combinations of hyperparameter values weighted by their prior probabilities. The discrepancies between these two strategies can be large and in a publication, int.strategy should be set to auto (the default setting).
The number of threads (similar to cores) can be chosen. The higher the number, the faster the inference due to parallelisation. To see how many threads are available for use, run detectCores(). Always save at least 1 core for running background tasks!
We are ready! We fit our first log-Gaussian Cox process model in inlabru!
We will fit the model using the like and bru approach because it allows to easily combine different data sources or data streams into an integrative joint models.
First, we define our likelihood object using the function like. We specify that the SpatialPointsDataFrame Sightings_survey is our data. Remember - it is crucial that the data are of type SpatialPointsDataFrame for the special coordinates variable to work. Next, we specify the domains for both our response variables coordinates and distance. We point coordinates to the spatial mesh mesh_land and point the distance variable to a 1D mesh that we define. Remember that we have set the upper limit to the distance equal to 2km.
Next, we point the likelihood to the formula object we created earlier that defines the linear predictor and specify that a Cox process model is being fitted by specifying family='cp'. Note that a HUGE library of likelihoods are available. Calling inla.list.models() provides a list of every possible model, prior and latent effect available. Finally, we provide a sensible (negative) initial value for lsig.
fit = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_1,
family = 'cp',
options = list(bru_initial = list(lsig = -1)))
Next, we call the function bru. This function combines all the likelihood objects together with the specified components and performs model inference. Note that we only have a single likelihood object fit, with components cmp_land1.
Running this function may take 2-5 min.
# Warning: can be slow!
fit <- bru(fit, components = cmp_land1)
Is this a good fit? One simple way of assessing how well a model fits the data is to assess it’s DIC value. DIC is similar to AIC/BIC in that it trades of goodness-of-fit with a penalty for the complexity of the model under the Occam’s Razor principle of ‘simple explanations trump complicated ones when both predict outcomes equally well.’ Whilst a lone DIC value is somewhat meaningless, we can compare the DIC values across competing models fit to the same data. The lower the DIC score, the better the model fit. A similar measure is the Watanabe-Akaike information criterion (WAIC). We will use both to judge model fit.
fit$dic$dic #DIC is 840. This is our benchmark
## [1] 840.3133
summary(fit) # WAIC 835
## inlabru version: 2.2.5
## INLA version: 21.01.08-1
## Components:
## mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ mySpatial + log_hn(distance, lsig) +
## Intercept + log(2)
## Time used:
## Pre = 6.6, Running = 8.02, Post = 0.861, Total = 15.5
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## lsig 0.063 0.152 -0.250 0.068 0.346 0.078 0
## Intercept -6.839 0.354 -7.539 -6.838 -6.149 -6.835 0
##
## Random effects:
## Name Model
## mySpatial SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Range for mySpatial 191.411 83.279 84.282 173.025 403.46 143.488
## Stdev for mySpatial 0.846 0.202 0.506 0.828 1.29 0.794
##
## Expected number of effective parameters(stdev): 25.40(0.00)
## Number of equivalent replicates : 472.58
##
## Deviance Information Criterion (DIC) ...............: 840.31
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 23.11
##
## Watanabe-Akaike information criterion (WAIC) ...: 835.34
## Effective number of parameters .................: 17.52
##
## Marginal log-Likelihood: -437.56
## Posterior marginals for the linear predictor and
## the fitted values are computed
So, our benchmark DIC and WAIC values for comparison are 840 and 835 respectively.
Let’s turn off the (slightly overwhelming) verbose messages from INLA for the remainder of the workshop.
bru_options_set(
bru_verbose = TRUE,
verbose = FALSE,
control.inla=list(int.strategy='eb'),
num.threads=2
)
Now that we have fit our log-Gaussian Cox process model, what can we learn? What can we say about the whale intensity?
inlabru provides many user-friendly functions for answering these questions. Three such functions are spde.posterior, predict and pixels. Let’s see them in action.
As a first objective, let’s plot the predicted whale intensity across the domain. This plot can help to visually assess where the model predicts ‘hotspots’ or areas of high whale activity exist. The following workflow can help to make professional plots easily:
pixels.# Define pixels for plotting
plot_pixels <- pixels(mesh_land, mask = Domain)
# mask = Domain ensures we only predict within the waters!
predict. The package inlabru (and INLA) allows for approximate Monte Carlo samples of each random component to be generated from the posterior distributions of the model. The function predict then computes the user-specified function for each sample before taking the empirical mean, standard deviation, quantiles, etc., of these values. This approach allows the posterior distribution of arbitrarily complex nonlinear functions of parameters to be investigated.For the purposes of this workshop, we are going to use only 20 Monte Carlo samples. In practice, to reduce Monte Carlo error, this number may need to be >1000. The Monte Carlo standard error should always be reported. We set the seed so that the results are reproducible.
# Set random seed
seed <- c(12345)
# Predict the intensity defined by the function exp(mySpatial + Intercept) across the pixels
pred_int <- predict(fit, plot_pixels, ~ exp(mySpatial + Intercept), n.samples = 20, seed=seed)
# What summary statistics have been computed for the intensity?
names(pred_int)
## [1] "mean" "sd" "q0.025" "median" "q0.975" "smin" "smax" "cv"
## [9] "var"
Notice how lots of useful summary statistics are computed by default using the predict function!
gg. We will plot both the posterior mean (representing the predicted whale intensity) and the posterior standard deviation (representing the uncertainty around the prediction).# Define a 'pretty' colour palette
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
limits = range(...))
}
# Plot the posterior mean intensity
ggplot() + gg(pred_int[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int@data$mean)
# Plot the posterior SD (note the index [2])
ggplot() + gg(pred_int[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int@data$sd)
From these plots, we see 2 interesting features. First, a very low whale intensity is predicted in the regions frequently visited by the whale-watching companies. Second, hotspots are identified in the Western region that had lots of survey effort focused in it, and in the center of the study region.
Next, let’s look at the posterior distributions of the spatial range and the marginal variance of the random field. Remember: the spatial range (in km) is estimating the ‘approximate distance required between two points for the values of the intensity their to be approximately independent.’ For this, we use the functions spde.posterior and plot.
spde.range <- spde.posterior(fit, "mySpatial", what = "range")
plot(spde.range)
spde.var <- spde.posterior(fit, "mySpatial", what = "variance")
plot(spde.var)
We can see that the model is estimating a spatial field with a reasonably large spatial range (~200km) and small variance (<1). No evidence of severe overfitting is shown (a very large variance with a tiny spatial range can be suspicious).
We can plot the estimated detection probability function, with uncertainty intervals. For this, we use predict once again, with one slight modification. Instead of using pixels to define a set of SpatialPixels to plot over, we must manually specify a data.frame object with the desired values of distance to predict across. Then, we can use the function plot as before. Note: I have had some issues with using the predict function on non-spatial data.frames. If you get an error message, see Additional tips.
distdf <- data.frame(distance = seq(0, 2, length = 100))
dfun <- predict(fit, distdf, ~ exp(log_hn(distance, lsig)), n.samples = 20, seed=seed)
plot(dfun)
This shows the probability of detecting a group (given that it is at the ‘surface’), as a function of distance from the vessel. The posterior mean and 95% credible intervals are shown. Note that for a publication, n.samples may need to be much higher. To estimate the Monte Carlo standard error at a distance 1, we can do the following:
dfun$sd[50]/sqrt(20) #sqrt(20) by central limit theorem
## [1] 0.01710149
This is huge!
inlabru provides functions for predicting the total number of groups! That is, we can investigate the posterior distribution for the expected number of groups of whales under the following assumptions:
every recorded encounter was with a single group
all groups can be spotted at distance 0m!
No availability bias (perhaps whales are diving and are missed)
No influence of group size on detection probability
Sufficient mixing time has been allowed between encounters.
Under these (flawed) assumptions, we predict the total number of groups falling within the Domain. To achieve this, we use the function ipoints. This function sets a series of integration points with weights (areas), to allow us to approximate various integrals of interest with a Riemann sum approximation. For example, we can estimate the integral of the true intensity surface across the domain to obtain estimates of the total number of groups as follows: \[\int_\Omega \lambda_{true}(s) ds \approx \sum_{A_i} \hat{\bar{\lambda}}_{true}(s_i) |A_i|,\]
where \(\hat{\bar{\lambda}}_{true}(s_i)\) are a single sample of values of \(\lambda_{true}\) evaluated at the locations \(s_i\) defined by ipoints, with weights \(|A_i|\). By generating a large number of Monte Carlo samples of \(\hat{\bar{\lambda}}_{true}(s_i)\), we are able to compute any quantity of interest for the total group size, including the mean, standard deviation, etc.
Note that ipoints creates a SpatialPointsDataFrame object with a single named variable weight containing the area of each region corresponding to each integration point.
predpts <- ipoints(Domain, mesh_land)
head(predpts$weight) # these are the areas of the regions corresponding each point.
## [1] 22.639690 77.300268 65.187338 5.828893 9.064174 169.091984
Then, we use predict, with the user-specified function equal to a weighted sum of the intensity. Notice that we have omitted the detection probability function log_hn. Thus, we are predicting across \(\Omega\), assuming perfect detectability occurs at a distance of 0 metres.
# Define the weighted sum function
Lambda <- predict(fit, predpts, ~ sum(weight * exp(mySpatial + Intercept)), n.samples = 20, seed=seed)
Lambda
## mean sd q0.025 median q0.975 smin smax cv
## 1 557.3873 98.42689 369.2358 586.4482 690.033 367.518 702.9996 0.1765862
## var
## 1 9687.853
There are estimated to be approximately 557 groups with a standard of approximately 98.
There remains a potential problem with our inference. If we look back at the plot of our domain, we notice that it extends in the South-Western direction beyond our survey tracklines about 45km. Thus, there is a risk of extrapolation error! Ideally, we should restrict our predictions to lie ‘close’ to our effort! This will be especially true when we start to include environmental covariates as the risk of extrapolating predictions beyond the range of observed covariates can be high.
Part A. Plot a map of the whale intensity, but colour grey all pixels that lie over 20km away from the nearest survey trackline.
Hint 1: To extract a subset of a SpatialPoints object named X that lies within a 20km distance of another Spatial object named Y, we run the following code:
X_subset < X[which(apply(gWithinDistance(X,Y,dist = 20,byid = T),2,sum)>0),]
Hint 2: We will need to convert the SpatialPixels intensity object into a SpatialPoints before subsetting. To convert a SpatialPixels object named A to a SpatialPoints object named B run the following:
B <- as(A, 'SpatialPointsDataFrame')
If you get really stuck, feel free to show the answer code!
# Convert the SpatialPixels into a SpatialPoints object before subsetting
pred_int_points <- as(pred_int,'SpatialPoints')
# Subset to 20 km from Effort lines - can take ~1 minute
pred_int_restricted <-
pred_int[which(apply(gWithinDistance(pred_int_points, Effort_survey,dist = 20,byid = T),2,sum)>0),]
# Plot the posterior mean
ggplot() + gg(pred_int_restricted[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int_restricted@data$mean)
# Plot the standard deviation
ggplot() + gg(pred_int_restricted[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(pred_int_restricted@data$sd)
Part B. Estimate the total number of whales in the region lying within 20km of the nearest trackline.
If you get really stuck, feel free to show the answer code!
# Subsetting the ipoints that contain the weights (area)
predpts_restricted <- predpts[
which(apply(gWithinDistance(predpts,Effort_survey,dist = 20,byid = T),2,sum)>0),]
# Using predict to estimate the number of groups within the restricted region
Lambda_restricted <- predict(fit, predpts_restricted, ~ sum(weight * exp(mySpatial + Intercept)), n.samples = 20, seed=seed)
Lambda_restricted
## mean sd q0.025 median q0.975 smin smax cv
## 1 466.7425 81.49448 321.0557 479.8608 566.4465 317.1937 572.5899 0.1746026
## var
## 1 6641.35
You should get a smaller estimate for the total number of groups of approximately 500 and a smaller, but still large, standard deviation of approximately 80. Don’t worry if your answers are slightly different - we’ve only used 20 samples! For the remainder of these workshops, when estimating the number of groups of fin whales, we will use the restricted domain.
Adding environmental covariates to models is made easy in inlabru. All that is required is that each covariate is converted to class SpatialPixelsDataFrame and that the covariate data are complete. By ‘complete,’ we mean that a non-missing value is available at every point within the region defined by the mesh boundary.
Let’s add the log_Slope covariate. To do this, we need to redefine the components and formula. First we define the updated components.
cmp_land2 <- ~ mySpatial(main = coordinates, model = matern_land) +
Intercept(1) + log_Slope(main = log_Slope, model='linear') +
lsig(1)
Notice that we have defined a new variable called log_Slope. Using the argument main, we point the variable to the SpatialPixelsDataFrame object log_Slope. log_Slope defines the product of a parameter (e.g. \(\beta_{SST}\)) and the value of the covariate log Slope evaluated at the correct spatial location (e.g. \(X(s)\)). Finally, we tell inlabru that we are fitting a linear effect.
Next, we redefine our new formula object:
formula_2 = coordinates + distance ~ mySpatial +
Intercept + log_Slope +
log_hn(distance, lsig) +
log(2)
The only change here, is the addition of the linear effect log_Slope to the linear predictor. For clarity, the linear predictor defined in formula_2 is:
\[\textrm{log}(\lambda_{true}(s)) = \beta_0 + \beta_1 X(s) + Z(s) \\ \textrm{log}(p(d)) = \textrm{log}(2) + \frac{-d^2}{2\sigma^2}.\]
Once again, log_Slope denotes the product terms \(\beta_1 X(s)\), not the parameter \(\beta_1\). The name of \(\beta_1\) is stored internally as log_Slope_latent. More on this later!
Thus, the expected encounter rate of an animal at location \(s\), a distance \(d\) away from the aircraft is:
\[\lambda_{obs}(s) = 2 \lambda_{true}(s) p(d).\]
Note that \(Z(s)\) is the value of the GRF mySpatial evaluated at position \(s\), and \(\sigma\) is the exponential of lsig.
Now we can fit the model using the like/bru approach:
Step 1, define the likelihood
fit2 = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_2,
family = 'cp')
Step 2, fit the model with bru.
This will take 2-5 min to run.
# Slow to run bru
fit2 <- bru(fit2, components = cmp_land2)
## iinla: Iteration 1 [max:10]
## iinla: Iteration 2 [max:10]
## iinla: Max deviation from previous: 9.16% of SD [stop if: <1%]
## iinla: Iteration 3 [max:10]
## iinla: Max deviation from previous: 2.29% of SD [stop if: <1%]
## iinla: Iteration 4 [max:10]
## iinla: Max deviation from previous: 9.34% of SD [stop if: <1%]
## iinla: Iteration 5 [max:10]
## iinla: Max deviation from previous: 11.3% of SD [stop if: <1%]
## iinla: Iteration 6 [max:10]
## iinla: Max deviation from previous: 7.6% of SD [stop if: <1%]
## iinla: Iteration 7 [max:10]
## iinla: Max deviation from previous: 4.46% of SD [stop if: <1%]
## iinla: Iteration 8 [max:10]
## iinla: Max deviation from previous: 4.69% of SD [stop if: <1%]
## iinla: Iteration 9 [max:10]
## iinla: Max deviation from previous: 0.96% of SD [stop if: <1%]
## iinla: Convergence criterion met, running final INLA integration with known theta mode.
## iinla: Iteration 10 [max:10]
We can then explore the results.
fit2$dic$dic # DIC is 841 - marginally worse with log_slope included
## [1] 841.112
summary(fit2) # WAIC 837 - marginally worse
## inlabru version: 2.2.5
## INLA version: 21.01.08-1
## Components:
## mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## log_Slope: Model types main='linear', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ mySpatial + Intercept + log_Slope +
## log_hn(distance, lsig) + log(2)
## Time used:
## Pre = 8.52, Running = 7.72, Post = 1.59, Total = 17.8
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -6.944 0.376 -7.69 -6.942 -6.213 -6.937 0
## log_Slope -0.191 0.134 -0.46 -0.189 0.068 -0.185 0
## lsig 0.062 0.152 -0.25 0.067 0.346 0.078 0
##
## Random effects:
## Name Model
## mySpatial SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Range for mySpatial 210.862 98.988 87.82 187.953 464.77 152.477
## Stdev for mySpatial 0.812 0.208 0.47 0.791 1.28 0.751
##
## Expected number of effective parameters(stdev): 23.83(0.00)
## Number of equivalent replicates : 503.63
##
## Deviance Information Criterion (DIC) ...............: 841.11
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 21.82
##
## Watanabe-Akaike information criterion (WAIC) ...: 837.22
## Effective number of parameters .................: 17.34
##
## Marginal log-Likelihood: -441.97
## Posterior marginals for the linear predictor and
## the fitted values are computed
So, we find that adding log_Slope to the model leads to a worse model fit, as judged by DIC and WAIC. In the paper that developed the DIC method, Spiegelhalter et al. (2002) had the following to say: “Burnham and Anderson (1998) suggested models receiving AIC within 1–2 of the ‘best’ deserve consideration, and 3–7 have considerably less support: these rules of thumb appear to work reasonably well for DIC…”
Thus, given that we detect a difference in DIC of around 2, we conclude that the model without log_Slope is the preferred model, but that the model with log_Slope should also be considered. By ‘considered,’ we mean that predictions and inference should be compared across the models.
Furthermore, inspecting the output from summary(), we see that the 95% credible intervals slightly cover zero. Thus, we detect a marginally ‘significant’ negative effect of log_Slope. That is, we have some evidence to suggest that the whales prefer shallower waters, after controlling for effort.
Fit a model, with a quadratic term for log_Slope. To acheive this, include log_Slope_sq alongside log_Slope. Does this improve the model fit? Are either of the variables (the linear or quadratic terms) found to be significantly associated with the whale intensity?
Hint: Remember the components-like-bru stages.
Part A. Define the model components and formula.
cmp_land3 <- ~ mySpatial(main = coordinates, model = matern_land) +
Intercept(1) + log_Slope(main = log_Slope, model='linear') +
log_Slope_sq(main = log_Slope_sq, model='linear') +
lsig(1)
formula_3 = coordinates + distance ~ mySpatial +
Intercept + log_Slope +
log_Slope_sq +
log_hn(distance, lsig) +
log(2)
Part B. Define the likelihood with like.
fit3 = like(data = Sightings_survey,
samplers = Effort_survey,
domain = list(
coordinates = mesh_land,
distance = INLA::inla.mesh.1d(seq(0, 2, length.out = 30))),
formula = formula_3,
family = 'cp')
Part C. Fit the model with bru.
fit3 <- bru(fit3, components = cmp_land3)
fit3$dic$dic # 842 - marginally worse again - still within 2 of best
## [1] 842.5264
summary(fit3) # 839 - marginally worse again - still worth considering
## inlabru version: 2.2.5
## INLA version: 21.01.08-1
## Components:
## mySpatial: Model types main='spde', group='exchangeable', replicate='iid'
## Intercept: Model types main='linear', group='exchangeable', replicate='iid'
## log_Slope: Model types main='linear', group='exchangeable', replicate='iid'
## log_Slope_sq: Model types main='linear', group='exchangeable', replicate='iid'
## lsig: Model types main='linear', group='exchangeable', replicate='iid'
## Likelihoods:
## Family: 'cp'
## Data class: 'SpatialPointsDataFrame'
## Predictor:
## coordinates + distance ~ mySpatial + Intercept + log_Slope +
## log_Slope_sq + log_hn(distance, lsig) + log(2)
## Time used:
## Pre = 8.69, Running = 7.02, Post = 1.6, Total = 17.3
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -6.917 0.397 -7.703 -6.915 -6.142 -6.912 0
## log_Slope -0.223 0.164 -0.569 -0.214 0.077 -0.197 0
## log_Slope_sq -0.030 0.083 -0.207 -0.025 0.120 -0.016 0
## lsig 0.062 0.152 -0.250 0.067 0.346 0.077 0
##
## Random effects:
## Name Model
## mySpatial SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Range for mySpatial 216.458 103.240 89.094 192.369 481.11 155.335
## Stdev for mySpatial 0.824 0.213 0.473 0.803 1.30 0.762
##
## Expected number of effective parameters(stdev): 24.77(0.00)
## Number of equivalent replicates : 484.47
##
## Deviance Information Criterion (DIC) ...............: 842.53
## Deviance Information Criterion (DIC, saturated) ....: NaN
## Effective number of parameters .....................: 22.59
##
## Watanabe-Akaike information criterion (WAIC) ...: 838.88
## Effective number of parameters .................: 18.25
##
## Marginal log-Likelihood: -448.05
## Posterior marginals for the linear predictor and
## the fitted values are computed
The quadratic model fit by inlabru appears to be a slightly worse fit compared with the linear fit. The DIC and WAIC values have slightly increased and the quadratic term is not found to be ‘significant’ with respect to its 95% credible intervals (they cover 0).
So, we ultimately find that the three models (the one without log_Slope (fit), with a linear effect of log_Slope (fit2), and with a quadratic effect of log_Slope (fit3)) all offer competing descriptions of the survey data since their DIC/WAIC values are very similar (within 3). In the interest of time, we do not consider the results from the quadratic model further.
The DIC values very similar for fit and fit2 - the models with and without log_Slope included. Let’s see how the predictions and parameter estimates change under the second model. Are the conclusions drawn from the two models consistent?
plot_pixels2 <- pixels(mesh_land, mask = Domain)
pred_int2 <- predict(fit2, plot_pixels2,
~ exp(mySpatial + Intercept + log_Slope), n.samples = 20, seed=seed)
multiplot(ggplot() + gg(pred_int2[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$mean,pred_int@data$mean)) + ggtitle('Covariate Model Mean'),
ggplot() + gg(pred_int2[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$sd,pred_int@data$sd)) + ggtitle('Covariate Model SD'),
ggplot() + gg(pred_int[1]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$mean,pred_int@data$mean)) + ggtitle('Covariate-Free Model Mean'),
ggplot() + gg(pred_int[2]) + gg(Domain) + gg.spatiallines_mod(Effort_survey, colour='red') +
colsc(c(pred_int2@data$sd,pred_int@data$sd)) + ggtitle('Covariate-Free Model SD'),
layout = matrix(c(1:4),nrow=2,ncol=2,byrow = F))
# Look at parameters of spde model
spde.range2 <- spde.posterior(fit2, "mySpatial", what = "range")
spde.var2 <- spde.posterior(fit2, "mySpatial", what = "variance")
multiplot(plot(spde.range2)+ylim(range(spde.range$pdf)),plot(spde.range)+xlim(range(spde.range2$range)),
plot(spde.var2)+ylim(range(spde.var$pdf)),plot(spde.var)+xlim(range(spde.var2$variance)),
layout = matrix(1:4,2,2,byrow = F))
Next, we use predict and plot to display the estimated effect of log_Slope on the intensity. Note that the effect is nonlinear on the intensity scale, as the covariate is specified to be linear on the log intensity scale.
# plot the covariate effects
Slopedf <- data.frame(log_Slope = seq(from=min(log_Slope@data$log_Slope),
to=max(log_Slope@data$log_Slope),
length.out = 100))
Slopepred <- predict(fit2, Slopedf, ~ exp(log_Slope), n.samples=40, seed=seed)
plot(Slopepred)
# Manually if the predict function breaks
#samples <- generate(fit2, n.samples = 20,, seed=seed)
# depthpred <- sapply(samples,FUN = function(x){return(depthdf$log_Depth*x$log_Depth)})
# ggplot(data.frame(mean=apply(depthpred,1, mean),
# LCL=apply(depthpred,1, quantile, probs=c(0.025)),
# UCL=apply(depthpred,1, quantile, probs=c(0.975)),
# logdepth=seq(from=min(log_Depth@data$log_Depth),
# to=max(log_Depth@data$log_Depth),
# length.out = 100)),
# aes(y=mean,x=logdepth,ymax=UCL,ymin=LCL)) +
# geom_line() + geom_ribbon(alpha=0.4)
Note that the credible intervals cover a large range of effect sizes! There is a large amount of uncertainty with the effect of log Slope! This is to be expected with a small dataset!
Finally, we plot the estimated number of groups present in the restricted domain (restricted to points lying within 20km of the nearest survey trackline). We then plot the estimated detection function. Notice that the predict function fails to work for the variable distance! This is a bug in inlabru that will hopefully be corrected for soon. In the meantime, we manually extract the results by sampling all the components using the function generate and then manually combining them using sapply. Note that the variables LCL and UCL define a symmetric 95% credible interval.
# Plot detection probability function
distdf <- data.frame(distance = seq(0, 2, length = 100))
#dfun2 <- predict(fit2, distdf, ~exp(log_hn(distance,lsig)),n.samples = 20)
# If this calls an error run below
samples <- generate(fit2, n.samples = 20, seed=seed)
dfun2 <- sapply(samples,FUN = function(x){return(exp(log_hn(distdf$distance,x$lsig)))})
ggplot(data.frame(mean=apply(dfun2,1, mean),
LCL=apply(dfun2,1, quantile, probs=c(0.025)),
UCL=apply(dfun2,1, quantile, probs=c(0.975)),
distance=distdf$distance),
aes(y=mean,x=distance,ymax=UCL,ymin=LCL)) +
geom_line() + geom_ribbon(alpha=0.4)
# We can look at the posterior for expected number of groups of whales:
# Avoid extrapolation
Lambda2_restricted <- predict(fit2, predpts_restricted,
~ sum(weight * exp(mySpatial + Intercept +
log_Slope)),
n.samples=20, seed=seed)
Lambda2_restricted
## mean sd q0.025 median q0.975 smin smax cv
## 1 449.2228 68.57342 347.7188 416.2225 572.5602 318.6041 593.7964 0.152649
## var
## 1 4702.313
Lambda_restricted
## mean sd q0.025 median q0.975 smin smax cv
## 1 466.7425 81.49448 321.0557 479.8608 566.4465 317.1937 572.5899 0.1746026
## var
## 1 6641.35
We indeed see very similar estimates for the total number of groups, with overlapping credible intervals. Remember that in a publication we would draw far more than 20 samples! There is some evidence to suggest the uncertainty in the estimates is reduced when we include the log_Slope covariate (we see a smaller standard deviation).
In this session, we have learned how to fit log-Gaussian Cox process models to survey datasets that followed distance sampling protocols within a Bayesian framework. We have seen how to compare models, how to predict intensity across space, and how to estimate the total number of groups.
Coming up next, we will see how to incorporate presence-only datasets into our models. Crucially, we will find that by incorporating presence-only datasets into our models, we may be able to uncover previously hidden species-environment relationships, and use these to improve the resolution of our maps of whale intensity!
An alternative ‘solution’ to the jagged mesh problem is to define the mesh within a much larger convex hull around the coastline. Unfortunately, this approach will not take into account the coastline and hence will measure spatial ‘distance’ as euclidean distance and not ‘ocean-distance.’ One such mesh is defined below:
# Compare with a mesh that ignores the land barrier
mesh_noland <- inla.mesh.2d(boundary = Domain_restricted,
min.angle = 25,
max.edge = c(45,60),
cutoff = 30,
offset = c(10,20))
mesh_noland$crs <- Can_proj
plot(mesh_noland)
mesh_noland$n
## [1] 424
We ignore this mesh for the remainder of this session, since it fails to account for land as a barrier.
predict on non-spatial data.framesI have had some issues with using the predict function on non-spatial data.frames. If you get an error message, try using the generate - sapply method shown below.
samples <- generate(fit, n.samples = 20, seed=seed)
distpred <- sapply(samples,FUN = function(x){return(exp(log_hn(distdf$dist,x$lsig)))})
ggplot(data.frame(probability=apply(distpred,1, mean),
LCL=apply(distpred,1, quantile, probs=c(0.025)),
UCL=apply(distpred,1, quantile, probs=c(0.975)),
distance=distdf$distance),
aes(y=probability,x=distance,ymax=UCL,ymin=LCL)) +
geom_line() + geom_ribbon(alpha=0.4)
Below is a chunk of code defining a function (GRF_sim()) that simulates and plots a single Gaussian random field on the unit square. You are asked to specify the range and marginal standard deviation:
# Specify spatial range and variance
sim_range <- 0.1
sim_sd <- 2
# Function for simulating and plotting a GRF on a unit square
GRF_sim <- function()
{
# define inla mesh
mesh <- inla.mesh.2d(loc.domain = cbind(c(0,1,1,0),c(0,0,1,1)),
min.angle = 21, max.edge = 0.04, cutoff = 0.02)
plot(mesh)
mesh$n
# define spde
spde_obj <- inla.spde2.pcmatern(mesh, constr = T,
prior.range = c(sim_range, 0.5),
prior.sigma = c(sim_sd, 0.5))
# define indices
index <- inla.spde.make.index('smooth',n.spde = spde_obj$n.spde)
# define prediction matrix
A_proj_pred <- inla.spde.make.A(mesh, loc = mesh$loc[,c(1,2)])
# simulate GRF
Q = inla.spde2.precision(spde_obj, theta = c(log(sim_range), log(sim_sd) ))
field <- inla.qsample(Q=Q)
field <- field - mean(field)
plot_pixels <- pixels(mesh, mask=as(owin(),'SpatialPolygons'), nx=150, ny=150)
# predict the values onto a high res grid
A_proj_grid <- inla.spde.make.A(mesh, loc=plot_pixels@coords)
plot_pixels$field <- as.numeric(A_proj_grid %*% field)
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
limits = range(..., na.rm=TRUE))
}
ggplot() + gg(plot_pixels) + colsc(plot_pixels$field) +
ggtitle(paste0('SD = ',sim_sd,' Range = ',sim_range))
}
To simulate a field with the specified SD and range run
GRF_sim()